Tiffany Chan

Unsupervised Learning Assignment Credit Card Segmentation Project

In [430]:
#Import the basic fundamental libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [431]:
#Bring in the credit card customer data
ccdata= pd.read_excel('Credit Card Customer Data.xlsx')
  1. Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)

  2. Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)

In [432]:
#Evaluating the first 5 customer results from the data
ccdata.head()
Out[432]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
In [433]:
#Let's look at the shape of the data and see if there are any missing values. 
#If there are missing values, we can choose to impute them

print("Shape of dataset")
print(ccdata.shape)    #Shape of the data
print("")
print("Number of missing values in each variable") #Checking for missing values
print(ccdata.isnull().sum())
print("")
print("Table of Possible Duplicates in the data")
print(ccdata[ccdata.duplicated(keep='first')])     #Recording which cases are duplicates
print("")
Shape of dataset
(660, 7)

Number of missing values in each variable
Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64

Table of Possible Duplicates in the data
Empty DataFrame
Columns: [Sl_No, Customer Key, Avg_Credit_Limit, Total_Credit_Cards, Total_visits_bank, Total_visits_online, Total_calls_made]
Index: []

In the original dataset, there are in total 660 reported cases, and 7 columns. There are no duplicates in this data, therefore we do not have to perform imputations on any missing data. However, we may have to do imputations if there are extreme values.

In [434]:
#Looking at descriptive statistics. 
ccdata.describe()
Out[434]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
count 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000
mean 330.500000 55141.443939 34574.242424 4.706061 2.403030 2.606061 3.583333
std 190.669872 25627.772200 37625.487804 2.167835 1.631813 2.935724 2.865317
min 1.000000 11265.000000 3000.000000 1.000000 0.000000 0.000000 0.000000
25% 165.750000 33825.250000 10000.000000 3.000000 1.000000 1.000000 1.000000
50% 330.500000 53874.500000 18000.000000 5.000000 2.000000 2.000000 3.000000
75% 495.250000 77202.500000 48000.000000 6.000000 4.000000 4.000000 5.000000
max 660.000000 99843.000000 200000.000000 10.000000 5.000000 15.000000 10.000000

Looking at the descriptive statistics of this dataset, we know that the data needs to be standardized, so that certain columns do not get more weight than others. We can see this because the ranges differ. Avg_Credit_Limit has a higher range compared to all the other variables. Standardization helps to even out the weights between the variables. Of course, we should ignore the SI_No and Customer_Key columns because they are identity variables and do not carry quantitative value.

UNIVARIATE ANALYSIS

In [435]:
#Univariate analysis for Avg_Credit_Limit

#Looking at the histogram/distplot and the frequencies of average credit limit in the dataset
print(sns.distplot(ccdata['Avg_Credit_Limit']))             #Histogram/Distplot
print(ccdata['Avg_Credit_Limit'].value_counts())            #Frequencies
AxesSubplot(0.125,0.125;0.775x0.755)
8000      35
6000      31
9000      28
13000     28
10000     26
          ..
25000      1
153000     1
111000     1
112000     1
106000     1
Name: Avg_Credit_Limit, Length: 110, dtype: int64
In [436]:
#Boxplot for Avg_Credit_Limit
print(sns.boxplot(ccdata['Avg_Credit_Limit']))             #Boxplot
AxesSubplot(0.125,0.125;0.775x0.755)

This histogram/distplot shows that this variable is rightly skewed and may have a lot of outliers beyond the upper whisker in the traditional sense. Because we are dealing with k means and several hierarchical linkage methods that may be sensitive to outliers, we should replace their value with the mean, median or whisker value. Let's explore further whether we should proceed with this decision by conducting more EDA:

In [437]:
#We need to see if it would be worth it to impute the outliers in this feature.
#First we need the following measurements first.

q75, q25 = np.percentile(ccdata['Avg_Credit_Limit'], [75 ,25])
iqr = q75 - q25
print(q75)                   #Find 75th percentile
q75 + 1.5*iqr         #Find value at upper whisker in order to set limits to find the outliers.
48000.0
Out[437]:
105000.0
In [438]:
#We need to see if it would be worth it to impute the outliers in this feature.

Outliers = ccdata[(ccdata['Avg_Credit_Limit'] > 105000.0)]
len(Outliers)        #Count the number of outliers and then determine if it's worth imputing
Out[438]:
39

39 out 660 customers is 5.9% and is considered little, and we can therefore impute these numbers with the value of the upper whisker. The reason we should do this is because K means is sensitive to outliers. Average link is also known to be sensitive to outliers. Both methods calculate means, which are highly influenced by outliers.

In [439]:
#Replace the outliers with the upper whisker value

ccdata.Avg_Credit_Limit[ccdata['Avg_Credit_Limit']> 105000.0] = 105000.0
ccdata[['Avg_Credit_Limit']].boxplot()
Out[439]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ab31113d30>

No more outliers in this feature.

In [440]:
#Univariate analysis for Total_Credit_Cards

#Looking at the histogram/distplot and the frequencies of Total_Credit_Cards in the dataset
print(ccdata['Total_Credit_Cards'].value_counts())    #Frequencies
print(sns.distplot(ccdata['Total_Credit_Cards']))     #Distplot/histogram
4     151
6     117
7     101
5      74
2      64
1      59
3      53
10     19
9      11
8      11
Name: Total_Credit_Cards, dtype: int64
AxesSubplot(0.125,0.125;0.775x0.755)
In [441]:
#Boxplot for Total_Credit_Cards
print(sns.boxplot(ccdata['Total_Credit_Cards']))
AxesSubplot(0.125,0.125;0.775x0.755)

It is clear from the boxplot and the distplot/histogram that there are no outliers beyond the upper and lower whiskers for total number of credit cards. There could possibly be 4 clusters according to the kde shown here, but we are only observing this on a 1-dimensional scale. We will postpone further discussion on potential cluster number when we look at the scaled pairplot later on in this assignment.

In [442]:
#Univariate analysis for Total_visits_bank

#Looking at the histogram/distplot and the frequencies of Total_visits_bank in the dataset
sns.distplot(ccdata['Total_visits_bank'])
ccdata['Total_visits_bank'].value_counts()
Out[442]:
2    158
1    112
3    100
0    100
5     98
4     92
Name: Total_visits_bank, dtype: int64
In [443]:
#Boxplot for Total_vote_bank
sns.boxplot(ccdata['Total_visits_bank'])    #Boxplot
Out[443]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ab312bce50>

It is clear from the boxplot and the distplot/histogram that there are no outliers beyond the upper and lower whiskers for total bank visits.

In [449]:
#Univariate analysis for Total_calls_made

#Looking at the histogram/distplot and the frequencies of Total_calls_made in the dataset

sns.distplot(ccdata['Total_calls_made'])
ccdata['Total_calls_made'].value_counts()
Out[449]:
4     108
0      97
2      91
1      90
3      83
6      39
7      35
9      32
8      30
5      29
10     26
Name: Total_calls_made, dtype: int64
In [450]:
#Boxplot for Total_calls_made
sns.boxplot(ccdata['Total_calls_made'])
Out[450]:
<matplotlib.axes._subplots.AxesSubplot at 0x2ab2ee4a820>

There are no traditional outliers for total calls made in this dataset.

BIVARIATE ANALYSIS AND SCALING

In [451]:
# K Means and machine learning libraries
from sklearn.model_selection  import train_test_split
from sklearn.cluster import KMeans

#Import the following for scaling.
from scipy.stats import zscore 
In [452]:
#Pairplot for bivariate analysis. 
#Use diagonal univariate analysis to count the possible number of clusters for K means.
ccdata1=ccdata.iloc[:,2:]                      #Exclude the ID variables: Customer Key and SI_No
ccdatascaled=ccdata1.apply(zscore)             #For scaling. Make the mean = 0, and sd = 1
sns.pairplot(ccdatascaled,diag_kind='kde')     #Pairplot code, set the diagonal with kde.
Out[452]:
<seaborn.axisgrid.PairGrid at 0x2ab313cc940>

To determine the number of minimum clusters, we have to focus on the diagnonal, which is basically a scaled version of the histogram/distplots with the superimposed kde plotted above. It seems that in evaluating total credit cards, the possible minimum number of clusters could be 4. This is only a possibility. There could definitely be more clusters due to the presence of these other variables/dimensions. We are only observing this using 1-2 dimensions. If we incorporated the other variables, there may be other cluster(s) that are not immediately obvious to the eye in the 2-dimensional space.

  1. Execute K-means clustering use elbow plot and analyse clusters using boxplot (10 marks)
In [455]:
#Finding optimal no. of clusters using the Elbow plot
from scipy.spatial.distance import cdist      #Used to calculate distances between points
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)                # K Means
    model.fit(ccdatascaled)                   #Fit the model on scaled data
    prediction=model.predict(ccdatascaled)    # We will use the k means model to predict on the scaled data
    meanDistortions.append(sum(np.min(cdist(ccdatascaled, model.cluster_centers_, 'euclidean'), axis=1)) / ccdatascaled.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Out[455]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')

The elbow plot is subjective, but can be a useful guide to determing a good number of clusters for k means clustering. It is advised to choose the k value where there is an "elbow" in the plot. This is when the within group sum of squares or variation shifts very little, when the line graph is becoming straighter and has the least amount of variance influence. In this case, it seems like when k = 3, the line graph begins to straighten out horizontally and most of the variation is already captured by 3 clusters. Adding a fourth cluster would not significantly add to explaining the variation of the data. Perhaps we should also explore k = 4, and examine their respective silhouette score.

  1. Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)

  2. Calculate average silhouette score for both methods. (5 marks)

PERFORMING K MEANS, EVALUATING K = 3

In [456]:
# Let us first start with K = 3
final_model=KMeans(3)
final_model.fit(ccdatascaled)
prediction=final_model.predict(ccdatascaled)

#Append the prediction and create a variable called "GROUP", which are the different clusters, for the regular data and the scaled data.
ccdata1["GROUP"] = prediction
ccdatascaled["GROUP"] = prediction
print("Groups Assigned : \n")
ccdata1.head()
Groups Assigned : 

<ipython-input-456-62bf37408435>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ccdata1["GROUP"] = prediction
Out[456]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made GROUP
0 100000 2 1 1 0 0
1 50000 3 0 10 9 1
2 50000 7 1 3 4 0
3 30000 5 1 1 4 0
4 100000 6 0 12 3 2

The 3 group/cluster numbers are: 0, 1, and 2.

In [457]:
##Calculating the means of every variable in the unscaled data and organizing it by cluster using groupby

ccdataclust = ccdata1.groupby(['GROUP'])
ccdataclust.mean()
Out[457]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 33782.383420 5.515544 3.489637 0.981865 2.000000
1 12174.107143 2.410714 0.933036 3.553571 6.870536
2 102660.000000 8.740000 0.600000 10.900000 1.080000

Those that are have higher average of average credit limit have a higher average of credit cards and tend to bank online more frequently. Those who are of the middle tier in terms of average credit limit are more likely to visit and/or call the bank than they are to engage in online banking.

In [458]:
#Observe the boxplots for the scaled data and visualize if k means clustering is a good segmentation method for this data. 

ccdatascaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
Out[458]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30ABE940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30AD8F40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30B04340>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30B2E6A0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30B5AA00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30B86D60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30B93D00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB30BC7520>]],
      dtype=object)

Additional statistical method to determine independence between clusters: Visually, these boxplots show much overlap when the data is segmented into 3 clusters (k = 3). One method to ascertain statistically significant overlap is to use a statistical test like ANOVA, if you want to see if all 3 clusters are independent of one another when it comes to these 5 variables. The only downside of ANOVA would be that it can only tell if at least 1 cluster is not indpendent of another cluster. It won't tell which cluster is independent or not independent.

Another observation is that there are outliers in particular clusters for certain variables, such as Group 1 and Group 2 in Avg_Credit_Limit and Group 1 in Total_visits_online. These outliers were not present in the original dataset after we handled the outliers for Avg_Credit_Limit. The reason for these is because in K means the centroids are randomly inserted into the multi-dimensional space among the observations. These distances between each observation and the centroids are subject to change depending on where the centroid is randomly placed.

However, calculating the silhouette score is a metric that is also effective for k means clustering analysis.

In [459]:
#Calculate silhouette score
from sklearn.metrics import silhouette_score
k_means_3_score = silhouette_score(ccdatascaled, final_model.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % k_means_3_score)
Silhouette Score: 0.532
In [460]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_K_Means_3 ={'Metric':['Silhouette Score'], 'K Means(3 Clusters)':[k_means_3_score]}
dataframe1 = pd.DataFrame(silhouette_score_K_Means_3)
dataframe1
Out[460]:
Metric K Means(3 Clusters)
0 Silhouette Score 0.532248

The silhouette score can determine how separated are the clusters. It measures the average of how similar observations in multi-dimensional space are to their own cluster compared to other neighboring clusters.

The silhouette score ranges from -1 to 1. 1 means that the average of all points are closer to their respective cluster than the neighboring cluster(s). -1 means that the average of all points are closer to their neighboring clusters compared to their designated cluster.

For k = 3, the Silhouette score is 0.53, which is decent. Anything well above 0.5 is considered good.

EVALUATING K = 4

In [461]:
# Let's now try K = 4
final_model=KMeans(4)
final_model.fit(ccdatascaled)
prediction=final_model.predict(ccdatascaled)

#Append the prediction to the original data and the scaled data.
ccdata1["GROUP"] = prediction
ccdatascaled["GROUP"] = prediction
print("Groups Assigned : \n")
ccdata1.head()                   #Making sure the new GROUP variable is appended.
Groups Assigned : 

<ipython-input-461-44187c0da297>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ccdata1["GROUP"] = prediction
Out[461]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made GROUP
0 100000 2 1 1 0 3
1 50000 3 0 10 9 2
2 50000 7 1 3 4 3
3 30000 5 1 1 4 3
4 100000 6 0 12 3 0
In [462]:
#Calculating the means of every variable in the unscaled data and organizing it by cluster using groupby

ccdataclust4 = ccdata1.groupby(['GROUP'])
ccdataclust4.mean()
Out[462]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 102660.000000 8.740000 0.600000 10.900000 1.080000
1 17266.968326 5.502262 3.719457 1.009050 1.923077
2 12174.107143 2.410714 0.933036 3.553571 6.870536
3 55903.030303 5.533333 3.181818 0.945455 2.103030

For 4 clusters, the group with the highest average credit limit are still most active online. There is a middle group that still favors bank visits and telephone banking calls. However, there are 2 clusters that have the lowest average credit limit. Cluster 2 is more similar to the lower tier group in the 3 clusters analysis. Group 1 is very different, as these individuals have a preference for total banks visits compared to all other means

In [463]:
#Boxplots to show clustering with k = 4

ccdatascaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
Out[463]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33940580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3394C1F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3397F910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33BFBC70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33C33040>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33C5E370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33C6B310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33C92AF0>]],
      dtype=object)

Looking at these boxplots, there seems to be a lot more overlap between the clusters here than when k = 3. We could potentially see a decrease in the silhouette score. Like what was mentioned before, the presence of outliers are attributed where the centroids are located in reference to the observations.

In [464]:
#Calculating silhouette score for k = 4
from sklearn.metrics import silhouette_score
k_means_4_score = silhouette_score(ccdatascaled, final_model.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % k_means_4_score)
Silhouette Score: 0.521
In [465]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_K_Means_4 ={'Metric':['Silhouette Score'], 'K Means(4 Clusters)':[k_means_4_score]}
dataframe2 = pd.DataFrame(silhouette_score_K_Means_4)
dataframe2
Out[465]:
Metric K Means(4 Clusters)
0 Silhouette Score 0.521261

It seems that k=3 is the better number of clusters for this data when it comes to K Means. The silhouette score decreased by a little (approx. 0.003), which suggests that splitting into a 4th cluster doesn't really contribute much to explaining the variation in the data.

HIERARCHICAL CLUSTERING:

In [536]:
#In order to not confuse ourselves, let us create a new dataframe that eliminates the GROUP variable formed from K means before we attempt hierarchical clustering

print("Unscaled data as it is now")
print(ccdata1.head())                #Unscaled data as it is now
print("")
print("Scaled data as it is now")
print(ccdatascaled.head())           #Scaled data as it is now
print("")

#Create new dataframe with removed GROUP feature/variable
ccdata2=ccdata1.iloc[:,:5] 
ccdatascaled2=ccdatascaled.iloc[:,:5] 



#Viewing the unscaled and scaled dataframes for hierarchical clustering
print("New unscaled dataframe with dropped GROUP variable/feature")
print(ccdata2.head())
print("")
print("New scaled dataframe with dropped GROUP variable/feature")
ccdatascaled2.head()
Unscaled data as it is now
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  GROUP  
0                    1                 0      3  
1                   10                 9      2  
2                    3                 4      3  
3                    1                 4      3  
4                   12                 3      0  

Scaled data as it is now
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0          2.398942           -1.249225          -0.860451   
1          0.643619           -0.787585          -1.473731   
2          0.643619            1.058973          -0.860451   
3         -0.058511            0.135694          -0.860451   
4          2.398942            0.597334          -1.473731   

   Total_visits_online  Total_calls_made  GROUP  labels  
0            -0.547490         -1.251537      3       0  
1             2.520519          1.891859      2       0  
2             0.134290          0.145528      3       2  
3            -0.547490          0.145528      3       2  
4             3.202298         -0.203739      0       1  

New unscaled dataframe with dropped GROUP variable/feature
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  
0                    1                 0  
1                   10                 9  
2                    3                 4  
3                    1                 4  
4                   12                 3  

New scaled dataframe with dropped GROUP variable/feature
Out[536]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 2.398942 -1.249225 -0.860451 -0.547490 -1.251537
1 0.643619 -0.787585 -1.473731 2.520519 1.891859
2 0.643619 1.058973 -0.860451 0.134290 0.145528
3 -0.058511 0.135694 -0.860451 -0.547490 0.145528
4 2.398942 0.597334 -1.473731 3.202298 -0.203739
In [ ]:
 
In [467]:
#Let's use the agglomerative clustering technique

from sklearn.cluster import AgglomerativeClustering 

HIERARCHICAL CLUSTERING USING AVERAGE LINKAGE METHOD FOR 3 CLUSTERS

In [468]:
#We can use the average linkage method and then fit it to the scaled data.
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean',  linkage='average')
model.fit(ccdatascaled2)
Out[468]:
AgglomerativeClustering(linkage='average', n_clusters=3)
In [469]:
#We can also append these as labels to the unscaled and scaled data, like what we did previously when we discussed K means.
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [470]:
#Checking to see if the labels are appended to the scaled data. We need this to make the boxplots.
ccdatascaled2.head()
Out[470]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made labels
0 2.398942 -1.249225 -0.860451 -0.547490 -1.251537 2
1 0.643619 -0.787585 -1.473731 2.520519 1.891859 0
2 0.643619 1.058973 -0.860451 0.134290 0.145528 0
3 -0.058511 0.135694 -0.860451 -0.547490 0.145528 0
4 2.398942 0.597334 -1.473731 3.202298 -0.203739 1
In [471]:
#Boxplots to show hierarchical clustering by label (3 clusters) using average link

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[471]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB342E6D60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB342F7AF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3458B250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB345B75B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB345E2910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3460EC70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3461BC10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3464F430>]],
      dtype=object)

There is still a lot of overlap between the different clusters, which suggests that these clusters may not be entirely separated from each other. Like previously, there are some new outliers but these as a result of the hierarchical clustering.

In [472]:
#We can evaluate all the means according to each feature by label/cluster.

ccdatacluster.mean()
Out[472]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 33713.178295 5.511628 3.485788 0.984496 2.005168
1 141040.000000 8.740000 0.600000 10.900000 1.080000
2 12197.309417 2.403587 0.928251 3.560538 6.883408
In [473]:
#Import libraries to calculate the cophenetic coefficient, to create the dendogram with different linkages

from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist                # This is for pairwise distribution between data points
In [474]:
# calculating cophenetic coefficient
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering

Z = linkage(ccdatascaled2, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(ccdatascaled2))

c
Out[474]:
0.8936260063525493
In [475]:
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_avg ={'Metric':['Cophenetic Coefficient'], 'Average Linkage CC':[c]}
dataframe11 = pd.DataFrame(Cophenetic_coeff_avg)
dataframe11
Out[475]:
Metric Average Linkage CC
0 Cophenetic Coefficient 0.893626

This cophenetic measure shows high correlation between the euclidean distance between points in the multi-dimensional space and the dendogram distance.

In [476]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()

The big difference between hierarchical clustering (dendograms) and K means is that for hierarchical clustering, we need to draw out the entire hierarchical tree. What's great about K means is that we can specify the number of clusters we would like. We need to read the dendogram and decide how many clusters we would like. In order to calculate the silhouette score, we need to cut the dendogram according to the last p merged clusters.

In [477]:
#For the sake of comparison to the K means models, we should specify 3 and 4 clusters

#Let's start with specifying p = 3

dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [478]:
max_d = 4.2      # There are 3 clusters formed at approximately 3.3
In [479]:
#We need to check if the max_d is an appropriate estimation by looking at the clusters distribution in an array form. 
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[479]:
array([3, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int32)

After checking this array, we can see there are 1s,2s, and 3s. So, because we see 3 groups, our estimation of max_d is appropriate.

In [480]:
#Let's calculate the silhouette coefficient for 3 clusters using average linkage.
hc_3_clusters_silh_avg = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_avg
Out[480]:
0.43237007558575374

This silhouette coefficient falls short of decent. 0.5 is normally construed decent.

In [481]:
#Make a dataframe with this information so that we can merge it into a big table and compare this measure to the others.

silhouette_score_hc_avg_3 ={'Metric':['Silhouette Score'], 'Average(3 Clusters)':[hc_3_clusters_silh_avg]}
dataframe3 = pd.DataFrame(silhouette_score_hc_avg_3)
dataframe3
Out[481]:
Metric Average(3 Clusters)
0 Silhouette Score 0.43237

HIERARCHICAL CLUSTERING USING AVERAGE LINKAGE METHOD FOR 4 CLUSTERS

In [482]:
#Let's now evaluate Average Link technique with 4 clusters using boxplots

model = AgglomerativeClustering(n_clusters=4, affinity='euclidean',  linkage='average')
model.fit(ccdatascaled2)
Out[482]:
AgglomerativeClustering(linkage='average', n_clusters=4)
In [483]:
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [484]:
#Boxplots to show hierarchical clustering by label (4 clusters)

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[484]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33799D90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB337A6FD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB337C6670>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB337E4E50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33804D00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3382EBB0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB33839B50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3386E040>]],
      dtype=object)

There is still a lot of overlap between the clusters.

In [485]:
#Let's now look at p = 4, and calculate the silhouette coefficient

dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=4,  # show only the last p merged clusters
)
plt.show()
In [486]:
max_d = 3.3    # After reading the zoomed in dendogram, the dendogram distance seems to be approximately 3.3
In [487]:
#Let's check to see we have 4 different cluster types
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[487]:
array([4, 3, 2, 2, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int32)
In [488]:
hc_4_clusters_silh_avg = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_avg
Out[488]:
0.5658109575174386
In [489]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_hc_avg_4 ={'Metric':['Silhouette Score'], 'Average(4 Clusters)':[hc_4_clusters_silh_avg]}
dataframe4 = pd.DataFrame(silhouette_score_hc_avg_4)
dataframe4
Out[489]:
Metric Average(4 Clusters)
0 Silhouette Score 0.565811

The silhouette score went up for 4 clusters compared to 3. This shows that average link hierarchical clustering and k means are 2 entirely different methods of clustering.

HIERARCHICAL CLUSTERING USING COMPLETE LINKAGE METHOD WITH 3 CLUSTERS

In [566]:
#We can use the complete linkage method and then fit it to the scaled data.
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean',  linkage='complete')
model.fit(ccdatascaled2)
Out[566]:
AgglomerativeClustering(linkage='complete', n_clusters=3)
In [567]:
#We can also append these as labels to the unscaled and scaled data (3 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [ ]:
 
In [568]:
#This is made to answer the last question of the assignment
ccdata5 = ccdata2
print(ccdata5.head())
print("")
print("Frequency of all 3 labels")
print(ccdata5['labels'].value_counts())


ccdataclust5 = ccdata5.groupby(['labels'])
ccdataclust5.mean()
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            100000                   2                  1   
1             50000                   3                  0   
2             50000                   7                  1   
3             30000                   5                  1   
4            100000                   6                  0   

   Total_visits_online  Total_calls_made  labels  
0                    1                 0       0  
1                   10                 9       2  
2                    3                 4       0  
3                    1                 4       0  
4                   12                 3       1  

Frequency of all 3 labels
0    389
2    221
1     50
Name: labels, dtype: int64
Out[568]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 33629.820051 5.503856 3.478149 0.997429 2.015424
1 102660.000000 8.740000 0.600000 10.900000 1.080000
2 12149.321267 2.389140 0.918552 3.561086 6.909502

After clustering with the complete linkage method, we can see how many individuals are in each group. Lower (Group 0) and mid-tier (Group 2) levels when it comes to average credit limit seem to make the bulk of this dataset. The group with the highest average credit limit only has 50 customers in it. The other two have the majority of customers.

In [545]:
#Boxplots to show hierarchical clustering by label (3 clusters) using complete link

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[545]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B329F40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B3C8FD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B3FB580>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B6688E0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B694C40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B6C0FA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B6CCF40>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3B700760>]],
      dtype=object)

The way how these boxplots are displayed shows an improvement in separation between the clusters, compared to the previous plots. The plot for total_visits online shows only a slight overlap between groups 0 and 2, while group 1 is only overlapping with an outlier from group 0 after clustering. does not show that the clusters are completely separated from one another. Please ignore the group boxplot.

In [546]:
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering

Z_complete = linkage(ccdatascaled2, metric='euclidean', method='complete')
c_complete, coph_dists = cophenet(Z_complete , pdist(ccdatascaled2))

c_complete
Out[546]:
0.9104796847785294

This is a really high cophenetic correlation coefficient. It suggests that this method is pretty good at clustering and that the dendogram stays faithful to pairwise distances in the data.

In [547]:
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_complete ={'Metric':['Cophenetic Coefficient'], 'Complete Linkage CC':[c_complete]}
dataframe12 = pd.DataFrame(Cophenetic_coeff_complete)
dataframe12
Out[547]:
Metric Complete Linkage CC
0 Cophenetic Coefficient 0.91048

The cophenetic coefficient is also very high with the complete method for hierarchical clustering using Euclidean distance. We must now construct the tree dendogram.

In [498]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z_complete, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()

From this, you can tell that the branching is the same with both methods of agglomerative hierarchical clustering so far. The only major difference is the dendogram distances at which clusters are formed.

In [548]:
#Let's narrow in on the last 3 merged clusters
dendrogram(
    Z_complete,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [549]:
max_d = 6.3          #From reading the dendogram, we can see the dendogram distance at 3 clusters is about 6.3
In [550]:
#Let's check to see if there are actually 3 cluster types.
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_complete, max_d, criterion='distance')
clusters
Out[550]:
array([3, 2, 3, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int32)
In [551]:
#Let's calculate the silhouette coefficient using the complete link hierarchical clustering method for 3 clusters

hc_3_clusters_silh_complete = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_complete
Out[551]:
0.5811328079630839
In [552]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_hc_complete_3 ={'Metric':['Silhouette Score'], 'Complete(3 Clusters)':[hc_3_clusters_silh_complete]}
dataframe5 = pd.DataFrame(silhouette_score_hc_complete_3)
dataframe5
Out[552]:
Metric Complete(3 Clusters)
0 Silhouette Score 0.581133
In [553]:
#Let's look at the boxplots for 4 clusters using the complete linkage method.

model = AgglomerativeClustering(n_clusters=4, affinity='euclidean',  linkage='complete')
model.fit(ccdatascaled2)
Out[553]:
AgglomerativeClustering(linkage='complete', n_clusters=4)
In [554]:
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [555]:
#Boxplots to show hierarchical clustering by label (4 clusters) using complete link

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[555]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38D6A340>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38D975B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38DB69D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38DE2D30>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38E1B0D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38E46430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38E533D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38E7CBB0>]],
      dtype=object)

COMPLETE LINKAGE WITH 4 CLUSTERS

In [556]:
#Calcuate the silhouette coefficient
#Let's narrow in on the last 4 merged clusters
dendrogram(
    Z_complete,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=4,  # show only the last p merged clusters
)
plt.show()
In [557]:
max_d = 4.7                 # Distance at which we see 4 clusters surface.
In [558]:
#Let's check to see that there are 4 clusters present
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_complete, max_d, criterion='distance')
clusters
Out[558]:
array([4, 2, 3, 3, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int32)
In [559]:
#Silhouette coefficient for 4 clusters using Complete Linkage
hc_4_clusters_silh_complete = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_complete
Out[559]:
0.5397253688116589
In [560]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_hc_complete_4 ={'Metric':['Silhouette Score'], 'Complete(4 Clusters)':[hc_4_clusters_silh_complete]}
dataframe6 = pd.DataFrame(silhouette_score_hc_complete_4)
dataframe6
Out[560]:
Metric Complete(4 Clusters)
0 Silhouette Score 0.539725

So far, 4 clusters yields a lower silhouette score compared to 3 clusters using complete linkage methods. It's the opposite for average linkage.

HIERARCHICAL CLUSTERING USING WARD LINKAGE METHOD

In [513]:
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering

Z_ward = linkage(ccdatascaled2, metric='euclidean', method='ward')
c_ward, coph_dists = cophenet(Z_ward , pdist(ccdatascaled2))

c_ward
Out[513]:
0.8230379628614064
In [366]:
#Summarize into a dataframe so that we can merge this information into a bigger dataframe for cophnetic coefficient comparison between models
Cophenetic_coeff_ward ={'Metric':['Cophenetic Coefficient'], 'Ward Linkage CC':[c_ward]}
dataframe13 = pd.DataFrame(Cophenetic_coeff_ward)
dataframe13
Out[366]:
Metric Ward Linkage CC
0 Cophenetic Coefficient 0.823038

The cophenetic correlation coefficient is quite strong for Ward, but complete linkage dendogram is performing better. Let's now draw the ward dendogram.

In [514]:
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z_ward, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
plt.tight_layout()

WARD METHOD 3 CLUSTERS

In [515]:
#Let's look at the boxplots for 3 clusters using the ward linkage method.

model = AgglomerativeClustering(n_clusters=3, affinity='euclidean',  linkage='ward')
model.fit(ccdatascaled2)
Out[515]:
AgglomerativeClustering(n_clusters=3)
In [516]:
#We can also append these as labels to the unscaled and scaled data (3 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [517]:
#Boxplots to show hierarchical clustering by label (3 clusters) using ward link

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[517]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3903EB20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB38F9D640>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB390385B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3906A460>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3908E310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB390C71C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A0A2160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A0C9610>]],
      dtype=object)

All clusters represented by boxplots here are overlapping.

In [372]:
#Let's narrow in on the last 3 merged clusters
dendrogram(
    Z_ward,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [377]:
max_d = 45      #This is an approximate dendogram distance for 3 clusters.
In [378]:
#Let's check for 3 clusters
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_ward, max_d, criterion='distance')
clusters
Out[378]:
array([3, 1, 3, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype=int32)
In [379]:
#Silhouette score for 3 cluster hierarchical clustering using ward linkage
hc_3_clusters_silh_ward = silhouette_score(ccdatascaled2,clusters)
hc_3_clusters_silh_ward
Out[379]:
0.5322599958937628
In [520]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_hc_ward_3 ={'Metric':['Silhouette Score'], 'Ward(3 Clusters)':[hc_3_clusters_silh_ward]}
dataframe7 = pd.DataFrame(silhouette_score_hc_ward_3)
dataframe7
Out[520]:
Metric Ward(3 Clusters)
0 Silhouette Score 0.53226

This is considered a decent silhouette score.

HIERARCHICAL CLUSTERING USING WARD LINKAGE METHOD FOR 4 CLUSTERS

In [521]:
#Let's look at the boxplots for 4 clusters using the ward linkage method.

model = AgglomerativeClustering(n_clusters=4, affinity='euclidean',  linkage='ward')
model.fit(ccdatascaled2)
Out[521]:
AgglomerativeClustering(n_clusters=4)
In [522]:
#We can also append these as labels to the unscaled and scaled data (4 clusters).
ccdata2['labels'] = model.labels_
ccdatascaled2['labels'] = model.labels_
In [523]:
#Boxplots to show hierarchical clustering by label (4 clusters) using ward link

ccdatascaled2.boxplot(by='labels', layout = (2,4),figsize=(15,10))
Out[523]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A6A2A00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A685D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A6D9370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A9536D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A980A30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A9ACD90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3A9B8D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002AB3902BFD0>]],
      dtype=object)
In [524]:
#Let's narrow in on the last 4 merged clusters
dendrogram(
    Z_ward,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=4,  # show only the last p merged clusters
)
plt.show()
In [525]:
max_d = 18       #This is an approximate dendogram distance for 4 clusters using the ward method.
In [526]:
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z_ward, max_d, criterion='distance')
clusters
Out[526]:
array([3, 1, 3, 4, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3,
       3, 4, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4,
       3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 4, 3, 4,
       4, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3,
       4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       4, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4,
       3, 3, 3, 4, 3, 3, 3, 3, 4, 4, 3, 4, 3, 3, 4, 3, 4, 4, 3, 3, 3, 4,
       3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 4, 4, 3, 3, 4, 4, 3, 3, 3,
       3, 4, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4, 3, 3, 4, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype=int32)
In [527]:
#Silhouette score for 4 clusters (ward linkage)
hc_4_clusters_silh_ward = silhouette_score(ccdatascaled2,clusters)
hc_4_clusters_silh_ward
Out[527]:
0.5644569180404536
In [528]:
#Make a dataframe with this information so that we can merge this into a big table and compare this measure to the others.

silhouette_score_hc_ward_4 ={'Metric':['Silhouette Score'], 'Ward(4 Clusters)':[hc_4_clusters_silh_ward]}
dataframe8 = pd.DataFrame(silhouette_score_hc_ward_4)
dataframe8
Out[528]:
Metric Ward(4 Clusters)
0 Silhouette Score 0.564457
  1. Compare K-means clusters with Hierarchical clusters. (5 marks)

COMPARING COPHENETIC COEFFICIENT OF EACH LINKAGE METHOD

In [563]:
#Merging the cophenetic correlation coefficient dataframes for comparison.

Metrics_Dataframe2 = pd.merge(dataframe11,dataframe12,how='outer',on='Metric')
Metrics_Dataframe2 = pd.merge(Metrics_Dataframe2, dataframe13,how='outer',on='Metric')
Metrics_Dataframe2
Out[563]:
Metric Average Linkage CC Complete Linkage CC Ward Linkage CC
0 Cophenetic Coefficient 0.893626 0.91048 0.823038

According to the results, complete linkage provides the highest cophenetic correlation (CPCC = 0.91). This means that of all the models we tested for hierarchical clustering, the dendogram created with complete linkage is the most faithful to the original Euclidean distances between scaled paired observations. This said, the dendogram created with complete linkage is perhaps the most reliable.

COMPARING SILHOUETTE SCORES FOR K MEANS

In [529]:
# Merge K Means dataframes together for comparison of silhouette score comparison

Metrics_Dataframe1 = pd.merge(dataframe1,dataframe2,how='outer',on='Metric')
Metrics_Dataframe1
Out[529]:
Metric K Means(3 Clusters) K Means(4 Clusters)
0 Silhouette Score 0.532248 0.521261

The higher and closer the silhouette score is to 1, the better it is normally. In this case, k = 3 has the higher silhouette score. Due to this, we would prefer to choose k =3 because we want a medium or a sweet spot between being too compressed (when everything is compressed into 1 cluster) and too accurate (when every point is its own cluster). The cost is not worth the benefit when it comes to increasing the amount of clusters beyond 3. The addition of a 4th cluster is not explaining much of the variation in the data.

COMPARING SILHOUETTE SCORES FOR HIERARCHICAL CLUSTERING

In [562]:
# Merge all dataframes hierarchical clustering dataframes for silhouette score comparison

Metrics_Dataframe3 = pd.merge(dataframe3,dataframe4,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe5,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe6,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe7,how='outer',on='Metric')
Metrics_Dataframe3 = pd.merge(Metrics_Dataframe3, dataframe8,how='outer',on='Metric')
Metrics_Dataframe3
Out[562]:
Metric Average(3 Clusters) Average(4 Clusters) Complete(3 Clusters) Complete(4 Clusters) Ward(3 Clusters) Ward(4 Clusters)
0 Silhouette Score 0.43237 0.565811 0.581133 0.539725 0.53226 0.564457

These silhouette scores were calculated for these dendogram models. Normally, we cannot specify the number of clusters beforehand, and we must derive our conclusions about clusters from the visual dendogram itself.

From the above results, 3 clusters had better silhouette coefficients than 4 clusters in only complete linkage, while 4 clusters had better silhouette scores than 3 clusters in both average and ward linkage techniques.

Average linkage in general is influenced by outliers, or high and low numbers because it is reliant on the mean. Complete and ward linkages are robust methods that can handle the presence of outliers. It is important, therefore, to put more confidence in these 2 latter techniques due to their strength to withstand noise. Complete linkage for 3 clusters has a higher silhouette score (+0.04) and is closer to 1 than complete linkage for 4 clusters. (Also, complete linkage had the highest cophenetic correlation coefficient which implies that the dendogram is the most faithful to the Euclidian pairwise distances and is more likely the better dendogram. Please see above.). When it comes to Ward linkage, it can be argued that the addition of a 4th cluster does not add significant explanation in variation for the data and the cost is not worth the benefit.

If we were to compare K means methodology to hierarchical clustering, it would seem that K means would be better because it would lower the computational expense compared to hierarchical clustering. Building a K means model is faster than building a hierarchical model in Python. The big reason is attributed to the number of Euclidian distances that must be computed between observations; K means calculates 3n distances (in this case), while hierarchical dendograms calculates n(n+1)/2 distances. For this reason, hierarchical clusters have a lot of distances to calculate compared to K means. However, hierarchical clustering tends to generate more mathematically intuitive results and have less expected assumptions even if it takes more time to process.

In any case, all Silhouette coefficients are above 0, acknowledging that a decent amount of observations/objects are closer to the points within their own respective cluster than to the other neighboring clusters.

In [ ]:
 
  1. Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions. (10 marks)

Key Questions:

How many different segments of customers are there? How are these segments different from each other? What are your recommendations to the bank on how to better market to and service these customers?

The clusters are not entirely separated, so we cannot truly conclude that they are independent and separated from each other. However, we can still draw conclusions on the clusters formed from K means and Hierarchical Clustering. K Means and Hierarchical Clustering are both very different processes, so the clusters may yield very different cluster groupings.

To answer this question, we must look back at the unscaled dataset with the appended GROUPS/labels for 3 clusters

In [404]:
#Means of every cluster for each feature using K Means clustering on unscaled data
ccdataclust.mean()
Out[404]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 33782.383420 5.515544 3.489637 0.981865 2.000000
1 12174.107143 2.410714 0.933036 3.553571 6.870536
2 102660.000000 8.740000 0.600000 10.900000 1.080000
In [405]:
#Medians of every cluster for each feature using K Means clustering on unscaled data
ccdataclust.median()
Out[405]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 31000 6 3 1 2
1 12000 2 1 4 7
2 105000 9 1 11 1

Here I put the the tables for the means and medians of each feature grouped by cluster for K means

Group 0 is the middle group. They have a moderate average credit limit. They have an average and median of 6 credit cards. They also seem to prefer human interaction than banking on their own. Of all the clusters, this cluster makes more calls to the bank, and have more bank visits.

Group 1 may be the most financially liable of all the groups. They have on average the lowest average credit limit and own the least amount of credit cards. They tend to call the bank more often than use other methods of banking.

Group 2 seems to have very high average credit limits. They also are the group with the most credit cards. They give the impression that they are trustworthy, pay their bills on time which leads to higher average credit limit, and always get credit card offers. They also accept a lot of credit card offers. They don't visit the bank too often and don't call the bank often and instead, they access the bank's website instead.

I think it would be better if we had more information about the clientele, such as job type, age, and mortgage. These factors could perhaps improve segmentation.

Let's now look at the segmentation done by Hierarchical Clustering, using Complete Linkage (the best modeled dendogram from the analysis)

In [569]:
#Means of every cluster for each feature using Hierarchical clustering (Complete Link) on unscaled data
ccdataclust5.mean()
Out[569]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 33629.820051 5.503856 3.478149 0.997429 2.015424
1 102660.000000 8.740000 0.600000 10.900000 1.080000
2 12149.321267 2.389140 0.918552 3.561086 6.909502
In [570]:
#Medians of every cluster for each feature using Hierarchical clustering (Complete Link) on unscaled data
ccdataclust5.median()
Out[570]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
labels
0 31000 6 3 1 2
1 105000 9 1 11 1
2 12000 2 1 4 7

The groups are labeled differently in hierarchical clustering but the patterns seen from k means is consistent. Group 0 from K means is also Group 0 here. Group 1 from K means is Group 2 here. Group 2 from K means is Group 1 here. Like what we saw in K means, the group with the highest average credit limit spend more time banking online. Those in the middle group spend more time calling the bank and paying visits to the bank. Calling the bank is also the preferred banking method for the group with higher liability.

Recommendations for the bank:

The target group: There seems to be 3 major segmentations among the customers in this banking dataset. Among these 3 clusters, the bank should focus on investing more money into reaching out to the middle tier of individuals. Those that are in the upper bracket have on average 9 credit cards already and may be less inclined to participate in another credit card promotion, despite having a high average credit limit across cards. It may be difficult to upsell to these individuals. This group is also very small in numbers. There are only approximately 50 individuals in this group. Those who belong to the lowest tier when it comes to average credit limit may be a financial liability because they have lower average credit limit, even though they may be more inclined to participate because they have less credit cards.

Advertisements suggestions: The bank should distribute advertisement funds across all 3 platforms but should put more emphasis on in-bank posters, phone commercial breaks while callers are on hold, and banking representatives promoting the deal by phone and in person. Online ads would only for the most part attract individuals of the extremes (the upper and lower tiers). The upper tier consists only a small portion of individuals and the lower tier could be a financial risk. Nonetheless, the payments for online ads should not be halted, as it could possibly attract other new customers with promising prospects.

In [ ]:
 
In [ ]: